9 research outputs found

    Rethinking the Evaluation of Unbiased Scene Graph Generation

    Full text link
    Since the severe imbalanced predicate distributions in common subject-object relations, current Scene Graph Generation (SGG) methods tend to predict frequent predicate categories and fail to recognize rare ones. To improve the robustness of SGG models on different predicate categories, recent research has focused on unbiased SGG and adopted mean Recall@K (mR@K) as the main evaluation metric. However, we discovered two overlooked issues about this de facto standard metric mR@K, which makes current unbiased SGG evaluation vulnerable and unfair: 1) mR@K neglects the correlations among predicates and unintentionally breaks category independence when ranking all the triplet predictions together regardless of the predicate categories, leading to the performance of some predicates being underestimated. 2) mR@K neglects the compositional diversity of different predicates and assigns excessively high weights to some oversimple category samples with limited composable relation triplet types. It totally conflicts with the goal of SGG task which encourages models to detect more types of visual relationship triplets. In addition, we investigate the under-explored correlation between objects and predicates, which can serve as a simple but strong baseline for unbiased SGG. In this paper, we refine mR@K and propose two complementary evaluation metrics for unbiased SGG: Independent Mean Recall (IMR) and weighted IMR (wIMR). These two metrics are designed by considering the category independence and diversity of composable relation triplets, respectively. We compare the proposed metrics with the de facto standard metrics through extensive experiments and discuss the solutions to evaluate unbiased SGG in a more trustworthy way

    Boundary Proposal Network for Two-Stage Natural Language Video Localization

    Full text link
    We aim to address the problem of Natural Language Video Localization (NLVL)-localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.Comment: AAAI 202

    Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives

    Full text link
    Reasoning about causal and temporal event relations in videos is a new destination of Video Question Answering (VideoQA).The major stumbling block to achieve this purpose is the semantic gap between language and video since they are at different levels of abstraction. Existing efforts mainly focus on designing sophisticated architectures while utilizing frame- or object-level visual representations. In this paper, we reconsider the multi-modal alignment problem in VideoQA from feature and sample perspectives to achieve better performance. From the view of feature,we break down the video into trajectories and first leverage trajectory feature in VideoQA to enhance the alignment between two modalities. Moreover, we adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature. In addition, we found that VideoQA models are largely dependent on language priors and always neglect visual-language interactions. Thus, two effective yet portable training augmentation strategies are designed to strengthen the cross-modal correspondence ability of our model from the view of sample. Extensive results show that our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark, which demonstrates the effectiveness of the proposed method

    Efficient Delivery of Curcumin by Alginate Oligosaccharide Coated Aminated Mesoporous Silica Nanoparticles and In Vitro Anticancer Activity against Colon Cancer Cells

    No full text
    We designed and synthesized aminated mesoporous silica (MSN-NH2), and functionally grafted alginate oligosaccharides (AOS) on its surface to get MSN-NH2-AOS nanoparticles as a delivery vehicle for the fat-soluble model drug curcumin (Cur). Dynamic light scattering, thermogravimetric analysis, and X-ray photoelectron spectroscopy were used to characterize the structure and performance of MSN-NH2-AOS. The nano-MSN-NH2-AOS preparation process was optimized, and the drug loading and encapsulation efficiencies of nano-MSN-NH2-AOS were investigated. The encapsulation efficiency of the MSN-NH2-Cur-AOS nanoparticles was up to 91.24 ± 1.23%. The pH-sensitive AOS coating made the total release rate of Cur only 28.9 ± 1.6% under neutral conditions and 67.5 ± 1% under acidic conditions. According to the results of in vitro anti-tumor studies conducted by MTT and cellular uptake assays, the MSN-NH2-Cur-AOS nanoparticles were more easily absorbed by colon cancer cells than free Cur, achieving a high tumor cell targeting efficiency. Moreover, when the concentration of Cur reached 50 μg/mL, MSN-NH2-Cur-AOS nanoparticles showed strong cytotoxicity against tumor cells, indicating that MSN-NH2-AOS might be a promising tool as a novel fat-soluble anticancer drug carrier
    corecore